Amanda Birmingham, CCBB, UCSD (abirmingham@ucsd.edu)
Related Notebooks:
Because all downstream QIIME analyses depend on the meta-data in the mapping file (see Required Inputs section in Introduction to 16S Microbiome Primary Analysis), the first step in analysis is validating that the mapping file is formatted correctly. QIIME provdes a script for this purpose called validate_mapping_file.py
; full details are available at http://qiime.org/scripts/validate_mapping_file.html . In brief, the script takes in a path to the mapping file to validate and a directory path to which to write outputs:
validate_mapping_file.py -m [mapping_file_path] -o [directory_path]
as shown in this example:
validate_mapping_file.py -m inputs/925_merged_prep_mapping2.txt -o mapping_validation/
Note that this script cannot be run in parallel, so it will use only one node of the cluster; however, since it is a fast step this doesn't present a problem.
The script assesses the mapping file for various errors and then writes a log file, an html file, and a corrected mapping file to the output directory. If there are any issues found, you will also see a message like this:
Errors and/or warnings detected in mapping file. Please check the log and html file for details.
The HTML output and the log output both contain largely the same information, and I have not found it necessary to check both. When working on a cluster terminal, the log is easier to use; a simple call like
head mapping_validation/925_merged_prep_mapping.log
will show the top of the log file, which lists the errors and warnings found, as in this example:
# Errors and warnings are written as a tab separated columns, with the first column showing the error or warning, and the second column contains the location of the error or warning, written as row,column, where 0,0 is the top left header item (SampleID). Problems not specific to a particular data cell will be listed as having 'no location'.
Errors -----------------------------
Warnings ---------------------------
Invalid characters found in "This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA (both bacteria and archaea), which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair ampli\_es the region 533\_786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA\_id:470367).] The reverse PCR primer is barcoded with a 12-base errorcorrecting Golay code to facilitate multiplexing of up to ξ1,500 samples per lane, and both PCR primers contain sequencer adapter regions. The three sequencing primers include two for reading in from each end of the amplicon and a third for reading the barcode. Because of technical limitations at the sequencing facility, only part of the barcode was sequenced, so we were unable to exploit the error-correcting properties fully; however, even with partial barcodes we were able to resolve the samples, demonstrating the robustness of the approach. It is important to note that this primer collection allows for sequencing of paired-end reads, but the downstream data analyses are not yet capable of supporting paired-end reads. Our results illustrate interesting and correlated patterns based on analysis of the unpaired reads (i.e., \_ and \_ diversity evaluations based on the 5\_ only and 3\_ only reads independently achieve similar results, suggesting that 100 bases in this region of the 16S gene can allow for successful screening and comparison of microbial communities). The reads generated from these PCR primers are both identi\_ed as \_recommended��� regions by Liu et al. (15). Polymerase Chain Reaction. Sample preparation was performed similarly to that described by Costello et al. (1). Brie\_y, each sample was ampli\_ed in triplicate, combined, and cleaned using the MO BIO 96 htp PCR clean up kit. PCR reactions contained 13 \_L MO BIO PCR water, 10 \_L 5 Prime Hot Master Mix, 0.5 \_L each of the forward and reverse primers (10 \_M \_nal concentration), and 1.0 \_L genomic DNA. Reactions were held at 94 C for 3 min to denature the DNA, with ampli\_cation proceeding for 35 cycles at 94 ξC for 45 s, 50 ξC for 60 s, and 72 ξC for 90 s; a \_nal extension of 10 min at 72 ξC was added to ensure complete ampli\_cation. Cleaned amplicons were quanti\_ed using Picogreen dsDNA reagent in 10 mM Tris buffer (pH 8.0). A composite sample for sequencing was created by combining equimolar ratios of amplicons from the individual samples, followed by gel puri\_cation and ethanol precipitation to remove any remaining contaminants and PCR artifacts." 1,9
Invalid characters found in "This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA (both bacteria and archaea), which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair ampli\_es the region 533\_786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA\_id:470367).] The reverse PCR primer is barcoded with a 12-base errorcorrecting Golay code to facilitate multiplexing of up to ξ1,500 samples per lane, and both PCR primers contain sequencer adapter regions. The three sequencing primers include two for reading in from each end of the amplicon and a third for reading the barcode. Because of technical limitations at the sequencing facility, only part of the barcode was sequenced, so we were unable to exploit the error-correcting properties fully; however, even with partial barcodes we were able to resolve the samples, demonstrating the robustness of the approach. It is important to note that this primer collection allows for sequencing of paired-end reads, but the downstream data analyses are not yet capable of supporting paired-end reads. Our results illustrate interesting and correlated patterns based on analysis of the unpaired reads (i.e., \_ and \_
Even if there are no errors, it is a good idea to skim the types of warnings created in order to see what they were ... in my experience, invalid characters in text fields (like protocols) and incorrect column ordering (specifically, not having the description
column as the last column in the file) are the most common, and are successfully dealt with by the automated correction. As mentioned above, the HTML file contains much the same information but in a more visual format:
Errors require you as the analyst to fix whatever problem was found (and then, of course, rerun the validation). If only warnings are detected, chances are that you do not make any changes to the mapping file yourself but can simply use the corrected mapping produced by the script. However, do remember to reference the corrected mapping file in all subsequent script calls rather than the original one, since the validation script does not overwrite or remove the original, issue-containing file!
As described in the Background and Required Inputs sections of the 1 Introduction to 16S Microbiome Primary Analysis document, this SOP covers the analysis of 16S data sequenced using the three-read technique and employing Golay error-correcting barcodes. In this case, QIIME makes demultiplexing, or assigning each read to its source sample based on the barcode attached to it, very simple. Start by ensuring that you indeed have a fastq file containing the "real" sequences (i.e., the reads from the read primer) and a second fastq file containing the barcode reads (from the index primer); if you do not, chances are that the data was not generated with the three-read approach and will require more processing. Assuming you do, unzip any zipped fastqs with the gunzip command, as in this example:
gunzip *.fastq.gz
Now all that is necessary is a single call to split_libraries_fastq.py
, which, as its name implies, splits the reads from the sequenced library (collection of DNA fragments) up into one set for each sample, based on the barcodes assigned to each sample. It may be something of a surprise to learn that it also encapsulates all the quality control checking in the QIIME pipeline, which is done automatically "under the hood"! Quality checks performed by this method include:
The call is of the format
split_libraries_fastq.py -m [mapping file path] -i [sequences fastq file path] -b [barcode fastq file path] -o [output directory path]
as shown in this example:
split_libraries_fastq.py -m /data/yellowstone_gradients/mapping_validation/925_merged_prep_mapping_corrected.txt -i /data/yellowstone_gradients/s_6_1_sequences.fastq -b /data/yellowstone_gradients/s_6_1_sequences_barcodes.fastq -o /data/library_split/
As always, remember to use the corrected mapping file! Additionally, note that this step is also not parallelizable.
After a successful run, the output directory will contain a log file (split_library_log.txt
), a tab-delimited text file listing the distribution of reads by length (histograms.txt
), and--as the main goal--a fasta file containing the read sequences labeled by the sample from which they came (seqs.fna
). The log file is a good place to start in assessing the results; running head
on it will show the main findings, since they are reported at the top:
Input file paths
Mapping filepath: /data/yellowstone_gradients//mapping_validation/925_prep_98_qiime_20141124-091849_corrected.txt (md5: a4048567752a64487db5265bbf1043e6)
Sequence read filepath: /data/yellowstone_gradients//s_6_1_sequences.fastq (md5: 572afdd3b02343c64fee37b52f1123c2)
Barcode read filepath: /data/yellowstone_gradients//s_6_1_sequences_barcodes.fastq (md5: b9824c34b990751a15a5f7c1ce9f9354)
Quality filter results
Total number of input sequences: 15034373
Barcode not in mapping file: 6991839
Read too short after quality truncation: 658828
Count of N characters exceeds limit: 21
Illumina quality digit = 0: 0
Barcode errors exceed max: 888033
Result summary (after quality filtering)
Median sequence length: 90.00
925.in5x 46567
925.in4z 41714
925.io4x 41362
925.in5y 39376
925.sa4x 39169
925.im4z 38288
925.sc2x 37312
925.in4y 34720
925.sa2y 34691
925.io1x 34424
925.sb1z 34196
In this example, we see the total number of input sequences followed by summary statistics about reads lost to various quality filters. The large number with "Barcode not in mapping file" is not a concern in this case; as the name implies, it represents reads whose barcode is not listed in the mapping file for this experiment, and it is common for this to be large when sequencing libraries from multiple experiments are combined in the same (usually HiSeq) sequencing run. Simply, these are reads from other experiments that were sequenced in the same run. However, if you have reason to believe these are a concern given a particular experimental/run design, you can pass the --retain_unassigned_reads
switch to split_libraries_fastq.py
to have these included in the sequence output with the sample id "Unassigned" so that you can examine them.
After these summaries, the log contains a list of each sample and the number of quality-check-passing reads assigned to it, as well as (down at the bottom) the total number of reads that passed quality-checking. The histogram file also contains count information, but segregated by read length bin instead of sample id:
Length Count
69.0 455092
79.0 742271
89.0 5298289
99.0 0
--
As usual, it is difficult to set firm thresholds for what is "acceptable" quality and what is "unacceptable". However, evidence suggests that useful determinations can be made for even relatively low numbers (10s to 100s) of sequences per sample, as shown in this figure from "Direct sequencing of the human microbiome readily reveals community differences" (Kuczynski J, Costello EK, Nemergut DR, et al. Genome Biology. 2010;11(5):210, available online at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2898070/):
Given this, if most of the samples have 100 or more reads after quality filtering, analysis is likely feasible.
In [ ]: